-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: mergingIter.SeekPrefixGE stops early if prefix cannot match #1085
Conversation
Ignore the first commit which is from #1079 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to take a closer look at the testing tomorrow, but this seems relatively small and safe. I like it.
Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @sumeerbhola)
merging_iter.go, line 637 at r2 (raw file):
return nil, nil } }
This is a lot smaller than I expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have sufficient test coverage already that a mistake in this optimization would cause a failure? I suspect yes, but just want to verify.
Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @sumeerbhola)
merging_iter.go, line 637 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
This is a lot smaller than I expected.
As an alternative, could you have performed this optimization down in sstable.singleLevelIterator
? In particular, a call to singleLevelIterator.SeekPrefixGE
could save away the prefix as mergingIter
does, and return false from either SeekPrefixGE
or a subsequent Next
when the key no longer has the desired prefix. I haven't thought through whether it is more desirable to do this in mergingIter
vs singleLevelIterator
.
5b352c2
to
4c0a523
Compare
This is motivated by the CockroachDB issues discussed in cockroachdb#1070 where range tombstones in L6 cause the iterator to go through all the deleted data. The situation is even worse in that each successive SeekPrefixGE in a batch request (which is in sorted order) will typically have to start all over again since the file at which the seek finally ended its work is probably later than the file which contains the relevant key. Note that CockroachDB does not set bounds when using SeekPrefixGE because the assumption is that the underlying Pebble implementation can imply bounds, and we don't want to change this behavior in CockroachDB. Since CockroachDB primarily uses range tombstones when deleting CockroachDB ranges, the SeekPrefixGE calls that were slow were probably on a preceding CockroachDB range, hence stopping early, as done in this PR, should fix the issue. If range tombstones were used within the CockroachDB range that is being read using SeekPrefixGE, the seek could land on a deleted key and will need to iterate through all its MVCC versions. Even in that case, the work is bounded by the number of deleted versions of a single MVCC key. The code stops early in prefix iteration mode, both for SeekPrefixGE and Next. BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the existing problem when with-tombstone=true. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 446ns ± 0% 433ns ± 0% -2.91% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 10.3ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 416ns ± 0% 429ns ± 0% +3.12% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 10.6ms ± 0% 0.0ms ± 0% -99.99% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 414ns ± 0% 437ns ± 0% +5.56% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 10.5ms ± 0% 0.0ms ± 0% -99.99% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.65µs ± 0% 1.75µs ± 0% +5.75% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 463ns ± 0% 459ns ± 0% -0.86% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.61µs ± 0% 1.67µs ± 0% +3.73% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 476ns ± 0% 475ns ± 0% -0.21% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.62µs ± 0% 1.77µs ± 0% +9.26% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 513ns ± 0% 525ns ± 0% +2.34% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 1.71µs ± 0% 1.84µs ± 0% +7.65% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.10µs ± 0% 1.16µs ± 0% +5.27% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 1.80µs ± 0% 1.86µs ± 0% +2.99% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.34µs ± 0% 1.20µs ± 0% -10.23% name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16 300B ± 0% 0B -100.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16 0.00B 0.00B 0.00% IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16 292B ± 0% 0B -100.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 0.00B 0.00B 0.00% MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 0.00B 0.00B 0.00% Informs cockroachdb#1070
4c0a523
to
3045305
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Do we have sufficient test coverage already that a mistake in this optimization would cause a failure? I suspect yes, but just want to verify.
I've added a test case to merging_iter_test.go
. And metamorphic testing will exercise this.
Also, I've removed the amortization logic since it was a premature optimization, and have added a comment there mentioning that we could do so, including the compares we are already incurring (which is why I removed the logic).
Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @petermattis)
merging_iter.go, line 637 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
As an alternative, could you have performed this optimization down in
sstable.singleLevelIterator
? In particular, a call tosingleLevelIterator.SeekPrefixGE
could save away the prefix asmergingIter
does, and return false from eitherSeekPrefixGE
or a subsequentNext
when the key no longer has the desired prefix. I haven't thought through whether it is more desirable to do this inmergingIter
vssingleLevelIterator
.
This is definitely debatable.
Theoretically, one could construct a large memtable with a similar pattern of sets which are range deleted and incur the cost of iterating over 64MB of data for each SeekPrefixGE, though it is unlikely to occur in practice.
In general, I've been making iterator optimizations at the highest layer in the iterator stack as possible, since it reduces the places to touch.
- The next optimizations ended up at the lowest layer because those iterators know whether they are close enough for it to be worthwhile (e.g. same ssblock), and know about bloom filter matching -- this is true for both the SeekPrefixGE, and SeekGE/LT with bounds optimizations (the former passes trySeekUsingNext up from the Iterator as a hint only, and the optimization is done at the lowest layers. I should have done the same for the latter by adding a parameter to SetBounds instead of making each InternalIterator do bounds comparisons).
- The noop optimization for SeekGE,LT was feasible to do in Iterator.
- For this change mergingIter was the highest level where it was possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really glad we didn't need to do anything more invasive than this.
Reviewed 1 of 5 files at r2, 4 of 14 files at r3.
Reviewable status: 2 of 6 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @petermattis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 6 files reviewed, all discussions resolved (waiting on @itsbilal)
merging_iter.go, line 637 at r2 (raw file):
Previously, sumeerbhola wrote…
This is definitely debatable.
Theoretically, one could construct a large memtable with a similar pattern of sets which are range deleted and incur the cost of iterating over 64MB of data for each SeekPrefixGE, though it is unlikely to occur in practice.
In general, I've been making iterator optimizations at the highest layer in the iterator stack as possible, since it reduces the places to touch.
- The next optimizations ended up at the lowest layer because those iterators know whether they are close enough for it to be worthwhile (e.g. same ssblock), and know about bloom filter matching -- this is true for both the SeekPrefixGE, and SeekGE/LT with bounds optimizations (the former passes trySeekUsingNext up from the Iterator as a hint only, and the optimization is done at the lowest layers. I should have done the same for the latter by adding a parameter to SetBounds instead of making each InternalIterator do bounds comparisons).
- The noop optimization for SeekGE,LT was feasible to do in Iterator.
- For this change mergingIter was the highest level where it was possible.
Ok, the approach here SGTM. Thanks for laying out your thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status: 2 of 6 files reviewed, all discussions resolved (waiting on @itsbilal)
This is motivated by the CockroachDB issues discussed
in #1070 where
range tombstones in L6 cause the iterator to go through
all the deleted data. The situation is even worse in that
each successive SeekPrefixGE in a batch request (which is
in sorted order) will typically have to start all over again
since the file at which the seek finally ended its work is
probably later than the file which contains the relevant key.
Note that CockroachDB does not set bounds when using
SeekPrefixGE because the assumption is that the underlying
Pebble implementation can imply bounds, and we don't want
to change this behavior in CockroachDB. Since CockroachDB
primarily uses range tombstones when deleting CockroachDB
ranges, the SeekPrefixGE calls that were slow were probably
on a preceding CockroachDB range, hence stopping early,
as done in this PR, should fix the issue. If range tombstones
were used within the CockroachDB range that is being read
using SeekPrefixGE, the seek could land on a deleted key
and will need to iterate through all its MVCC versions.
Even in that case, the work is bounded by the number of
deleted versions of a single MVCC key.
BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the
existing problem when with-tombstone=true.
Informs #1070